{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "213d6cca",
   "metadata": {},
   "source": [
    "# Small Molecule Rediscovery #\n",
    "\n",
    "**In this notebook, we will show how to explore the rediscovery capabilities of GB-BI for small molecules.**\n",
    "\n",
    "Welcome to our notebook on using GB-GI to rediscover small molecules using the Tanimoto similarity and fingerprints! Because GB-BI is a quality-diversity method, the end result of such an optimisation is a collection of high-scoring molecules (i.e. highly similar to the target molecule) with a diversity of physico-chemical properties. So, if you're eager to learn how to use GB-GI's rediscovery capabilities or you want to know how to use different types of fingerprints, you're in the right place. For rediscovery based on descriptors and conformer aggregate similarity, see the dedicated tutorial."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9861ed19-0ed3-40fd-b036-35d678844bcd",
   "metadata": {},
   "source": [
    "## Running GB-BI\n",
    "The most basic usage of GB-BI is to simply run the illuminate.py file, without any extra configurations specficied in the command line. In this case, GB-BI will call on the `config.yaml` file in ../Argenomic-GP/configuration folder by making use of Hydra (an open-source Python framework that handle a hierarchical configuration file which can be  easily be adapted through the command line.)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d1a59a1e-6e9d-400c-bcfe-949843a42b44",
   "metadata": {},
   "outputs": [],
   "source": [
    "python GB-BI.py"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cfa2c99-a2b3-4a93-9efd-d1026f3cb828",
   "metadata": {},
   "source": [
    "While initially the configuration file seem complex or even intimidating, each of the individual components are very easy to understand and adapt. Later on in this tutorial, we will go into detail for each of the components, but for now we just want to highlight a few key aspects. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "856e6d07",
   "metadata": {},
   "outputs": [],
   "source": [
    "---\n",
    "controller:\n",
    "  max_generations: 100\n",
    "  max_fitness_calls: 1000\n",
    "archive:\n",
    "  name: archive_150_4_25000\n",
    "  size: 150\n",
    "  accuracy: 25000\n",
    "descriptor:\n",
    "  properties:\n",
    "  - Descriptors.ExactMolWt\n",
    "  - Descriptors.MolLogP\n",
    "  - Descriptors.TPSA\n",
    "  - Crippen.MolMR\n",
    "  ranges:\n",
    "  - - 225\n",
    "    - 555\n",
    "  - - -0.5\n",
    "    - 5.5\n",
    "  - - 0\n",
    "    - 140\n",
    "  - - 40\n",
    "    - 130\n",
    "fitness:\n",
    "  type: Fingerprint\n",
    "  target: \"O=C1NC(=O)SC1Cc4ccc(OCC3(Oc2c(c(c(O)c(c2CC3)C)C)C)C)cc4\"\n",
    "  representation: ECFP4\n",
    "arbiter:\n",
    "  rules:\n",
    "  - Glaxo\n",
    "generator:\n",
    "  batch_size: 40\n",
    "  initial_size: 40\n",
    "  mutation_data: data/smarts/mutation_collection.tsv\n",
    "  initial_data: data/smiles/guacamol_intitial_rediscovery_troglitazone.smi\n",
    "surrogate:\n",
    "  type: Fingerprint\n",
    "  representation: ECFP4\n",
    "acquisition:\n",
    "  type: Mean"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42b75ed1-e897-44d4-8c46-83cf2d55dd60",
   "metadata": {},
   "source": [
    "This standard config file specifies GB_BI run for a maximum of 100 generations or 1000 fitness function calls with an archive containing 150 niches. This is also the maximum amount of molecules that could potentially make up the evolutionary population at any given generation. The archive is spanned by 4 descriptors (`Descriptors.ExactMolWt`, `Descriptors.MolLogP`, `Descriptors.TPSA`, `Crippen.MolMR`) an limited to corresponding ranges indicated in the configuration file. The fitness function is the Tanimoto similarity (standard for all fingerprint representations) applied to ECFP4 fingerprints. The target, Troglitazone, is specified by its SMILES representation. Finally, the molecular representation (`ECFP4 fingerprints`) for the surrogate model and the acquistion function (`Posterior Mean`) are also specified."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf86efde-f693-434d-ae72-787e948c079d",
   "metadata": {},
   "source": [
    "## Interpreting the Output of a GB-BI Optimisation Run ##"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ea4732a-c2e7-4d80-b2de-b44fe7029c39",
   "metadata": {},
   "source": [
    "In the same way we rely on HYDRA to take care of configurations files, we also use it to autmatically generate date and time-stamped output folders. For GB-BI, these folders contain a series of `archive_i.csv` files where `i` is the generation at the of which the archive was printed to file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "d1640914-1658-43c2-b969-7436b6c872a3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>acquisition_value</th>\n",
       "      <th>fitness</th>\n",
       "      <th>niche_index</th>\n",
       "      <th>predicted_fitness</th>\n",
       "      <th>predicted_uncertainty</th>\n",
       "      <th>smiles</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.425027</td>\n",
       "      <td>0.448980</td>\n",
       "      <td>3</td>\n",
       "      <td>0.425027</td>\n",
       "      <td>0.000435</td>\n",
       "      <td>O=C1NC(=O)C(Cc2ccc(OCC=C(O)Br)cc2)S1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.365707</td>\n",
       "      <td>0.377358</td>\n",
       "      <td>4</td>\n",
       "      <td>0.365707</td>\n",
       "      <td>0.000271</td>\n",
       "      <td>Cc1c(C)c2c(c(C)c1O)CCC(C)(CN1CCN(C)C1)O2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.423895</td>\n",
       "      <td>0.427273</td>\n",
       "      <td>5</td>\n",
       "      <td>0.423895</td>\n",
       "      <td>0.000272</td>\n",
       "      <td>CN1CCN(C(O)COc2ccc(CC3SC(=O)NC3=O)cc2)CC1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.537205</td>\n",
       "      <td>0.552083</td>\n",
       "      <td>6</td>\n",
       "      <td>0.537205</td>\n",
       "      <td>0.000205</td>\n",
       "      <td>Cc1c(C)c2c(c(C)c1O)CCC(C)(Cc1ccc(Br)cc1)O2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.361091</td>\n",
       "      <td>0.326316</td>\n",
       "      <td>7</td>\n",
       "      <td>0.361091</td>\n",
       "      <td>0.000935</td>\n",
       "      <td>O=C1NC(=O)C(Cc2ccccc2)S1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.436628</td>\n",
       "      <td>0.440000</td>\n",
       "      <td>8</td>\n",
       "      <td>0.436628</td>\n",
       "      <td>0.000139</td>\n",
       "      <td>CN(Br)N(Br)COc1ccc(CC2SC(=O)NC2=O)cc1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.649736</td>\n",
       "      <td>0.652632</td>\n",
       "      <td>10</td>\n",
       "      <td>0.649736</td>\n",
       "      <td>0.000634</td>\n",
       "      <td>Cc1c(C)c2c(c(C)c1O)CCC(C)(CONCC1SC(=O)NC1=O)O2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0.208211</td>\n",
       "      <td>0.215686</td>\n",
       "      <td>11</td>\n",
       "      <td>0.208211</td>\n",
       "      <td>0.000423</td>\n",
       "      <td>O=C1NC(=O)C(COCC(O)OBr)S1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.713624</td>\n",
       "      <td>0.765957</td>\n",
       "      <td>14</td>\n",
       "      <td>0.713624</td>\n",
       "      <td>0.000880</td>\n",
       "      <td>Cc1c(C)c2c(c(C)c1O)CCC(C)(c1ccc(CC3SC(=O)NC3=O...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0.800924</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>18</td>\n",
       "      <td>0.800924</td>\n",
       "      <td>0.000857</td>\n",
       "      <td>Cc1c(C)c2c(c(C)c1O)CCC(C)(COc1ccc(CC3SC(=O)NC3...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   acquisition_value   fitness  niche_index  predicted_fitness  \\\n",
       "0           0.425027  0.448980            3           0.425027   \n",
       "1           0.365707  0.377358            4           0.365707   \n",
       "2           0.423895  0.427273            5           0.423895   \n",
       "3           0.537205  0.552083            6           0.537205   \n",
       "4           0.361091  0.326316            7           0.361091   \n",
       "5           0.436628  0.440000            8           0.436628   \n",
       "6           0.649736  0.652632           10           0.649736   \n",
       "7           0.208211  0.215686           11           0.208211   \n",
       "8           0.713624  0.765957           14           0.713624   \n",
       "9           0.800924  1.000000           18           0.800924   \n",
       "\n",
       "   predicted_uncertainty                                             smiles  \n",
       "0               0.000435               O=C1NC(=O)C(Cc2ccc(OCC=C(O)Br)cc2)S1  \n",
       "1               0.000271           Cc1c(C)c2c(c(C)c1O)CCC(C)(CN1CCN(C)C1)O2  \n",
       "2               0.000272          CN1CCN(C(O)COc2ccc(CC3SC(=O)NC3=O)cc2)CC1  \n",
       "3               0.000205         Cc1c(C)c2c(c(C)c1O)CCC(C)(Cc1ccc(Br)cc1)O2  \n",
       "4               0.000935                           O=C1NC(=O)C(Cc2ccccc2)S1  \n",
       "5               0.000139              CN(Br)N(Br)COc1ccc(CC2SC(=O)NC2=O)cc1  \n",
       "6               0.000634     Cc1c(C)c2c(c(C)c1O)CCC(C)(CONCC1SC(=O)NC1=O)O2  \n",
       "7               0.000423                          O=C1NC(=O)C(COCC(O)OBr)S1  \n",
       "8               0.000880  Cc1c(C)c2c(c(C)c1O)CCC(C)(c1ccc(CC3SC(=O)NC3=O...  \n",
       "9               0.000857  Cc1c(C)c2c(c(C)c1O)CCC(C)(COc1ccc(CC3SC(=O)NC3...  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "archive_0 = pd.read_csv('./Example_Output/archive_100.csv')[['acquisition_value', 'fitness', 'niche_index',\t'predicted_fitness', 'predicted_uncertainty',\t'smiles']]\n",
    "archive_0.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f7b207f-b94b-49de-bd8b-8bc8da108f62",
   "metadata": {},
   "source": [
    "In addition, the folder also contains a `molecules.csv` file and a `statistics.csv` file. These files respectively comprise of all molecules (and their properties) that have been evalauted by the fitness function and the overall statistics of the archive and the GB-BI run at each generation. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "af9c0263-390d-44fe-ac68-d98ade7a89ab",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>generation</th>\n",
       "      <th>maximum fitness</th>\n",
       "      <th>mean fitness</th>\n",
       "      <th>quality diversity score</th>\n",
       "      <th>coverage</th>\n",
       "      <th>function calls</th>\n",
       "      <th>max_err</th>\n",
       "      <th>mse</th>\n",
       "      <th>mae</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0.418803</td>\n",
       "      <td>0.377648</td>\n",
       "      <td>3.398836</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>24</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0.495935</td>\n",
       "      <td>0.326890</td>\n",
       "      <td>9.806713</td>\n",
       "      <td>20.000000</td>\n",
       "      <td>54</td>\n",
       "      <td>0.189796</td>\n",
       "      <td>0.009939</td>\n",
       "      <td>0.078342</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>0.495935</td>\n",
       "      <td>0.337603</td>\n",
       "      <td>12.153720</td>\n",
       "      <td>24.000000</td>\n",
       "      <td>85</td>\n",
       "      <td>0.076884</td>\n",
       "      <td>0.001355</td>\n",
       "      <td>0.029830</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>0.495935</td>\n",
       "      <td>0.360824</td>\n",
       "      <td>14.432969</td>\n",
       "      <td>26.666667</td>\n",
       "      <td>119</td>\n",
       "      <td>0.060353</td>\n",
       "      <td>0.000754</td>\n",
       "      <td>0.021815</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>0.564815</td>\n",
       "      <td>0.375560</td>\n",
       "      <td>15.773541</td>\n",
       "      <td>28.000000</td>\n",
       "      <td>155</td>\n",
       "      <td>0.106445</td>\n",
       "      <td>0.001486</td>\n",
       "      <td>0.028396</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>0.592593</td>\n",
       "      <td>0.386288</td>\n",
       "      <td>17.382972</td>\n",
       "      <td>30.000000</td>\n",
       "      <td>190</td>\n",
       "      <td>0.089415</td>\n",
       "      <td>0.001411</td>\n",
       "      <td>0.030500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>0.791667</td>\n",
       "      <td>0.397842</td>\n",
       "      <td>19.096398</td>\n",
       "      <td>32.000000</td>\n",
       "      <td>226</td>\n",
       "      <td>0.225988</td>\n",
       "      <td>0.002762</td>\n",
       "      <td>0.032465</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>0.817204</td>\n",
       "      <td>0.402366</td>\n",
       "      <td>20.520662</td>\n",
       "      <td>34.000000</td>\n",
       "      <td>263</td>\n",
       "      <td>0.093953</td>\n",
       "      <td>0.000831</td>\n",
       "      <td>0.019793</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>0.817204</td>\n",
       "      <td>0.415433</td>\n",
       "      <td>21.602498</td>\n",
       "      <td>34.666667</td>\n",
       "      <td>301</td>\n",
       "      <td>0.091979</td>\n",
       "      <td>0.000716</td>\n",
       "      <td>0.018666</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.429936</td>\n",
       "      <td>22.356681</td>\n",
       "      <td>34.666667</td>\n",
       "      <td>338</td>\n",
       "      <td>0.199076</td>\n",
       "      <td>0.001700</td>\n",
       "      <td>0.022910</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   generation  maximum fitness  mean fitness  quality diversity score  \\\n",
       "0           0         0.418803      0.377648                 3.398836   \n",
       "1           1         0.495935      0.326890                 9.806713   \n",
       "2           2         0.495935      0.337603                12.153720   \n",
       "3           3         0.495935      0.360824                14.432969   \n",
       "4           4         0.564815      0.375560                15.773541   \n",
       "5           5         0.592593      0.386288                17.382972   \n",
       "6           6         0.791667      0.397842                19.096398   \n",
       "7           7         0.817204      0.402366                20.520662   \n",
       "8           8         0.817204      0.415433                21.602498   \n",
       "9           9         1.000000      0.429936                22.356681   \n",
       "\n",
       "    coverage  function calls   max_err       mse       mae  \n",
       "0   6.000000              24       NaN       NaN       NaN  \n",
       "1  20.000000              54  0.189796  0.009939  0.078342  \n",
       "2  24.000000              85  0.076884  0.001355  0.029830  \n",
       "3  26.666667             119  0.060353  0.000754  0.021815  \n",
       "4  28.000000             155  0.106445  0.001486  0.028396  \n",
       "5  30.000000             190  0.089415  0.001411  0.030500  \n",
       "6  32.000000             226  0.225988  0.002762  0.032465  \n",
       "7  34.000000             263  0.093953  0.000831  0.019793  \n",
       "8  34.666667             301  0.091979  0.000716  0.018666  \n",
       "9  34.666667             338  0.199076  0.001700  0.022910  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "statistics = pd.read_csv('./Example_Output/statistics.csv')\n",
    "statistics.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97f34aff",
   "metadata": {},
   "source": [
    "## Overwriting the Configuration File from the Command Line ##\n",
    "\n",
    "All the settings specified in the standard configuration file can easily be overwritten in the command line. For example, we can change the target to `Tiotixene` and change the acquistion function to the `expected improvement` (EI). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1cc18d8a-4feb-4cba-8174-b01e57de91ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "python GB-BI.py fitness.target=\"O=S(=O)(N(C)C)c2cc1C(\\c3c(Sc1cc2)cccc3)=C/CCN4CCN(C)CC4\" acquisition.type=EI"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4869a3a8",
   "metadata": {},
   "source": [
    "Thanks to the capabilities of HYDRA, we can also easily set up a multirun of the GB-BI algorithm which sweeps throuhg the different combinations of the configurations specified in the command line. For instance, here we start a multirun with `ECFP4` and `FCFP4` as molecular representations of the surrogate model and using the `posterior mean` (Mean) and the `expected improvement` (EI) as acquisition functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c6a2fad-efdf-4a6e-bfad-542879f74365",
   "metadata": {},
   "outputs": [],
   "source": [
    "python GB-BI.py hydra.mode=MULTIRUN surrogate.representation=ECFP4,FCFP4 acquisition.type=Mean,EI"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74d8d83b",
   "metadata": {},
   "source": [
    "## Components of the Configuration File ##\n",
    "\n",
    "As mentioned before, we will discuss each component of the configuration file. We will go through the different options for each of these components one by one and discuss potential pitfalls. The different components of the configuration file correspond to diffferent classes in the code."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53282940-cfe6-4e5c-9515-06238bd02350",
   "metadata": {},
   "source": [
    "**Controller:** This class controls the basic overall settings of a GB-BI run. In the configuration file, you can set the maximum amount of generations and the maximum amount of fitness calls alloted to the optimsation process. The GB-BI run will end when either has been reached. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "0826fcff",
   "metadata": {},
   "outputs": [],
   "source": [
    "controller:\n",
    "  max_generations: 100\n",
    "  max_fitness_calls: 1000"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c4ec758-23d4-4ee6-9644-8773a2ba9d4d",
   "metadata": {},
   "source": [
    "**Archive:** This class controls the quality-diversity archive at the core of GB-BI. The amount of niches is set in this part of the configuration file, but also two more technical aspects. The archive makes use of a centroidal Voronoi tessellation which needs to be sampled at the beginning of each run (with a certain accuracy). To avoid unncessary re-sampling the centroidal Voronoi tessellation can be stored in a file, the name of which is set here. note that the name should be changed if you change the accuracy, dimensionality, or size of the archive. The dimensionality is set in the next part of the configuration file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "868f10a6-1a99-4934-a1d7-1e0e75b329b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "archive:\n",
    "  name: archive_150_4_25000\n",
    "  size: 150\n",
    "  accuracy: 25000"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec52e35a-ab5b-44be-95fd-d8f7ab71c966",
   "metadata": {},
   "source": [
    "**Descriptor:** This class controls the amount and type of physicochemical properties that are used to describe a molecule and place it in its niche. The amounts of descriptors sets the dimensionality of the archive. When choosing ranges, be careful to make sure your target molecule lies within part of chemical space covered by the archive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71068fdf-6b80-4959-ba7b-8a956c99fc1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "descriptor:\n",
    "  properties:\n",
    "  - Descriptors.ExactMolWt\n",
    "  - Descriptors.MolLogP\n",
    "  - Descriptors.TPSA\n",
    "  - Crippen.MolMR\n",
    "  ranges:\n",
    "  - - 225\n",
    "    - 555\n",
    "  - - -0.5\n",
    "    - 5.5\n",
    "  - - 0\n",
    "    - 140\n",
    "  - - 40\n",
    "    - 130"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c89e9a9-cf60-46c2-a3fb-2644567f9127",
   "metadata": {},
   "source": [
    "**Fitness:** This class controls the type of fitness function that is used during the GB-BI run. For fingerprint-based rediscovery, the options for the representation are `ECFP4`, `ECFP6`, `FCFP4`, and `FCFP6` in line with the representations supported by the Gaucamol rediscovery benchmarks. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4249ca9-bad4-4e9f-8e66-729854b88c7e",
   "metadata": {},
   "outputs": [],
   "source": [
    "fitness:\n",
    "  type: Fingerprint\n",
    "  target: \"O=C1NC(=O)SC1Cc4ccc(OCC3(Oc2c(c(c(O)c(c2CC3)C)C)C)C)cc4\"\n",
    "  representation: ECFP4"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "46364584-2b5f-437f-9fb4-40a39e168199",
   "metadata": {},
   "source": [
    "**Arbiter:** This class controls the type of structural filters that are used during the GB-BI run. The options for the rules are `Glaxo`, `Dundee`, `BMS`, `PAINS`, `SureChEMBL`, `MLSMR`, `Inpharmatica`, and `LINT`.  The SMARTS in these rules are determined by Pat Walters, and more information on them can be found in the original scripts. When choosing rules, be careful to make sure your target molecule adheres to the chosen structural filters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48091e2f-efa5-4b2e-ac17-c97930ee3505",
   "metadata": {},
   "outputs": [],
   "source": [
    "arbiter:\n",
    "  rules:\n",
    "  - Glaxo"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "861b8f28-8b53-4124-8c5c-0a69c6264a65",
   "metadata": {},
   "source": [
    "Note that structural filters can be combined by adding them to configuration file. The combined options for the rules will be read in as a list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "488c8d7c-7773-4bc5-a60a-7d1d4ede9a7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "arbiter:\n",
    "  rules:\n",
    "  - Glaxo\n",
    "  - Dundee\n",
    "  - BMS"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5862957c-9b8f-4424-89d0-ee0d5ad1abe2",
   "metadata": {},
   "source": [
    "**Generator:** This class controls the crossovers and mutations in the GB-BI run. The initial size, which is randomly sampled from the inital data (i.e. a column of smiles), can be set in this part of the configuration file together with the batch size for the rest of the optimisation run. The path to the file with the mutation data (SMARTS) can also be changed here. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5506e63d-1c5e-4838-8889-cddc900d01a1",
   "metadata": {},
   "outputs": [],
   "source": [
    "generator:\n",
    "  batch_size: 40\n",
    "  initial_size: 40\n",
    "  mutation_data: data/smarts/mutation_collection.tsv\n",
    "  initial_data: data/smiles/guacamol_intitial_rediscovery_troglitazone.smi"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb69bb28-660d-4386-bf7e-f127b7592cf9",
   "metadata": {},
   "source": [
    "**Surrogate:** This class controls the molecular representation used in surrogate fitness model of the GB-BI run. There are two available types: `Fingerprint` and `String`. For `Fingerprint`, the representation options are `ECFP4`, `ECFP6`, `FCFP4`, `FCFP6`, `RDFP`, `APFP`, and`TTFP`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3c83805-64a3-4dc1-b03a-48ae3a6ef403",
   "metadata": {},
   "outputs": [],
   "source": [
    "surrogate:\n",
    "  type: Fingerprint\n",
    "  representation: ECFP4"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df9f28c2-c8f9-42b4-8bcc-4b17c714ada2",
   "metadata": {},
   "source": [
    "For `String`, the representation options are `Smiles` and `Selfies`. As these strings will be represented as a bag-of-characters in the Gaussian process, a maximum ngram `max_ngram` also needs to be defined in the configuration file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95795fce",
   "metadata": {},
   "outputs": [],
   "source": [
    "surrogate:\n",
    "  type: String\n",
    "  representation: Selfies\n",
    "  max_ngram: 5"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06be3918-ec5c-47cb-a5f7-aa6445913c17",
   "metadata": {},
   "source": [
    "**Acquisition:** This class controls the type of acquisition function used in the GB-BI run. The options for the type are the posterioir mean `Mean`, the upper confidence bound `UCB`, the expected improvement `EI`, and a numerically stable implementation of the logarithm of the expected improvement `logEI`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06f43da4-f906-402e-94a6-162ea37c8dcd",
   "metadata": {},
   "outputs": [],
   "source": [
    "acquisition:\n",
    "  type: Mean"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9a36232-1a2a-470f-b8bb-e00fce1023e7",
   "metadata": {},
   "source": [
    "For the the upper confidence bound `UCB`, an additional hyperparameter `beta` which balances exploitation and exploration needs to be defined as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48ca7b4d-22f3-4c45-abd2-134856366b78",
   "metadata": {},
   "outputs": [],
   "source": [
    "acquisition:\n",
    "  type: UCB\n",
    "  beta: 0.2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bef39f2-4477-4a05-92e7-cfce61df6953",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "* Griffiths, Ryan-Rhys, et al. \"Gauche: A library for Gaussian processes in chemistry.\" Advances in Neural Information Processing Systems 36 (2024).\n",
    "  \n",
    "* Verhellen, Jonas, and Jeriek Van den Abeele. \"Illuminating elite patches of chemical space.\" Chemical science 11.42 (2020): 11485-11491.\n",
    "  \n",
    "* Brown, Nathan, et al. \"GuacaMol: benchmarking models for de novo molecular design.\" Journal of chemical information and modeling 59.3 (2019): 1096-1108.\n",
    "\n",
    "* Jensen, Jan H. \"A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space.\" Chemical science 10.12 (2019): 3567-3572."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}